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1. INTRODUCTION 

Big data refers to a large amount of data created and diffused daily. Big data has a great influence, 
both commercial and economic, on the development of the global economy [1]. This huge amount of data 
constitutes a great source of power. It is an inexhaustible mine of knowledge that must be processed to 
extract valuable information. Thus, big data analytics have attracted many specialized companies and 
researchers who tried to improve and adapt the classical algorithms to handle voluminous data [2]. 

Hidden Markov models (HMMs) [3] are classical statistical models, widely used in many fields 
such as speech recognition [4], finance [5] or bioinformatics [6], but in a big data context, these models have 
not yet reached their maturity. In the previous approaches, the volume of the data did not present a real 
problem since, in general, we did not handle very large size of data. In addition, the applications used 
structured data, so their formats were not very varied which is not the case for big data. Also, algorithms 
were not fast enough to give solutions in real time or to manage the high speed of data generation and 
diffusion [7]. 

Big data in the context of HMMs applications can be tackled using different approaches, either by 
studying HMMs with multiple sequences, HMMs with long observation sequences or HMMs with a large 
amount of hidden or observed states. In recent years, various works have focused on how to introduce HMMs 
in big data applications in order to make full use of their potential. Thus, many researchers are working on 
improving algorithms which take into account the complex characteristics of big data [8]. In fact, HMMs 
algorithms must be adapted to meet the growing demand for data processing. One of the most promising 
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solution is to implement these algorithms under big data framework to take advantage of the powerful tools 
of these facilitating data distribution and parallelism of calculation. 

The decoding problem is one of the fundamental problems of HMMs. In this problem, in a given 
model A, we search the most probable state sequence that produced a given observations sequence O = 
{0,0 ...0r}. To solve this problem, Viterbi [9] and posterior decoding algorithms [10], [11] are two of the 
most used algorithms. Even if these two algorithms solve a similar problem, the Viterbi algorithm finds the 
global solution while the posterior decoding algorithm locally finds the most likely hidden states. Although 
the posterior decoding algorithm has shown its processing speed, efficiency and accuracy, it generally has 
some drawbacks when handling big data specifically the suboptimal complexity and high execution time 
[12]. With the exponential development of big data technologies, it is necessary to focus on new approaches 
to use these new technologies to improve classical algorithms in terms of analysis and processing power, 
mainly parallel distributed computing [13]. Spark is certainly one of the most powerful big data technologies 
which have demonstrated their effectiveness in several applications, and which is attracting more and more 
researchers. 

In this paper, we present a new parallel distributed version of posterior decoding algorithm under 
Spark [14] for HMMs decoding problem. We used the main concepts of the spark framework to achieve this 
implementation; To distribute the data over many blocks, we used the concept of resilient distributed datasets 
(RDD) [15], then for the parallel computation, the MapReduce paradigm [16] is used, and finally to reduce 
the communication cost, we used broadcast variables. One of the major advantages of the proposed solution 
based on spark is to benefit from the richness of its modules offering a variety of tools for data collection and 
preprocessing, a set of optimized algorithms for parallel calculation, and algorithms for analyzing data in real 
time, as well as the possibility of execution of graph algorithms. Through this implementation under spark in 
a cloud environment, we think we contribute to bring hidden Markov models into the new era of big data, 
which opens the doors to the use of hidden Markov models in various fields of applications requiring a huge 
amount of data and parallel processing. The main contributions of this paper are summarized as: 

— Reviewing the foundations of HMMs, mainly the decoding problem. 

— Proposing an improved posterior decoding algorithm, based on parallel distributed computing approach 
using Spark. 

— Evaluating the proposed approach in a cloud environment using several metrics. 

The remainder of this paper is organized as. Section II deals with the hidden Markov models 
fundamentals and presents the HMMs decoding problem followed by a detailed discussion of the posterior 
decoding algorithm. In section II], we explore some related works. In section IV, we describe the proposed 
parallel distributed posterior decoding algorithm under Spark. The experimental results of the proposed 
algorithm evaluation are presented and discussed in section V. Finally, we conclude the paper with a 
summary of our key contributions and discuss possible future work. 


2. RESEARCH BACKGROUND 
2.1. Hidden Markov models fundamentals 

Hidden Markov models are based on a Ist order Markov model simulating the evolution of the state 
of the system. It produces a sequence using two sequences of random variables; hidden and observable 
sequences. The hidden sequence corresponds to the sequence of states and the observable sequence 
corresponds to the sequence of observations [3]. They are statistical Markov models used in various fields. 
Especially in speech recognition and in signal processing and communications. Hidden Markov models are 
also used in computational biology and bioinformatics [6], in natural language modelling [17] as well as in 
finance analysis [5] and many other areas. 

The characteristics of an HMM are defined as [3], [18]: 
A = (A,B, II): a parametric set. 
N: the number of hidden states in the model. 
S = {s1,S2,---,Sn}: the set of N states. The state of the HMM at time t is noted q. 
M: the number of observable symbols in each hidden state. 
V = {Vj, V2,---, Vm}: the set of possible observations (the alphabet) is noted V. o, € V is the symbol observed 
at time t. 
O = {04, 02,..., Or}: the vector of T produced observations. 
Q = {q1,q42,---» Gr}, dt © S: the state sequence that produced an observation sequence. 
a,j: the transition probability which represents the probability that the model evolves from state s; to state sj, 
where: 
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aiy = P( dtr = 5; | at = 5), Vij € [1..N] and Vt € [1..T] (1) 


we denote [A] = {aij} as the transition probabilities matrix. 
bj(v;,): the observation symbol probability in hidden state s; where: 


bj(vK) = PCO = vel r=5)),1 <jSN1<k<™M (2) 


we denote [B] = {b,(v,)} as the observation probabilities matrix. 
m;: the vector of initial probabilities, where: 


m= P(q) = 5,1 2S N (3) 


we denote [II] = {m;} as the initial emission probabilities vector. 
P(A | O): the probability that the HMM A has produced the sequence O. 
The HMMs are used to solve three main problems: 

— Evaluation: Given the sequence of observations O and an HMM A, how to assess the probability of 
observation P(A | 0)? For this problem, a forward-backward dynamic programming procedure [19] is 
used to calculate the probability of the observation sequence efficiently. 

— Finding the most likely path: Given the sequence of observations O and an HMM A, how to find a 
sequence of states Q that maximizes the probability of observation of the sequence? Viterbi algorithm is 
a dynamic programming technique for finding this single best state sequence Q = {q1,42,93,---Qr} for 
the given observation sequence O = {0,,02, 03,...,07}. Another algorithm used to solve the decoding 
problem is the posterior decoding algorithm used when several paths have similar probabilities. 

— Learning: How to adjust the parameters (A,B, IT) of an HMM A to maximize P(A | 0), by using the 
Expectation-Maximization (EM) algorithm [20]. 


2.2. Posterior decoding algorithm 

In hidden Markov models decoding problem, given the sequence of observations O = 
{01, 02, 03,...,O7} and an HMM A, we seek for the most probable sequence of states Q that maximizes the 
probability of observation of the sequence? In this problem we try to guess the correct hidden sequence of 
states. There are two algorithms that are most used to solve this problem: Viterbi and posterior decoding 
algorithms. The definition of the sequence of probable states differs depending on the domain and may 
influence the final solution of the problem. One first approach looks to search for the most probable state q; 
and to concatenate all such "q,". It means that we have to choose states that are individually most likely at 
the time when a symbol is emitted. This approach is called posterior decoding. Another approach proposes to 
find the best path through the hidden state space, i.e., Viterbi algorithm. 

While the Viterbi algorithm remains the most used and efficient algorithm for the problem of 
decoding HMMs, in some applications it is not the most appropriate. One of the alternatives of this algorithm 
is the posterior decoding algorithm which is also widely used when there are many paths that have almost the 
same probability as the most likely. Posterior decoding algorithm provides the most likely state at any time. It 
focuses on the individual positions in the sequence and seeks to maximize the probability that they are well 
explained. 

Posterior decoding algorithm involves dynamic programming using the forward and backward 
algorithms and using sums instead of the maximization procedures to calculate the total probability for all 
possible paths. In forward algorithm, we define the forward variable, the probability of producing the partial 


observation sequence 0;, 02, 03,..., Of, (until time t) given the model A and that the current state is s; at time 
t, as: 

a, (i) = P(0,0203...04,d¢ = Si | A) (4) 
The forward algorithm to calculate P(O | A), the probability of the observation sequence 04,02, 03,..., Or, 
given the model J is as: 
Initialization 

a,(i) = 1,b;(0,),1<i<N (5) 
Recurrence 
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rai GU) = [Di a aij]; (Or41),1 St <T-—land1i<j<N (6) 
Termination (t=T) 
P(O1A) = Lida ar (i) (7) 


The backward variable 6; (i) is calculated similarly to @,(i) using a backward recursion given that we are 
starting from q; at the instant t. Hence, we define the backward variable as: 


Be@) = P(Oe410t42-+-Or |e = Si, A) (8) 
The backward algorithm is as: 
Initialization 

Br@M=1,1<5i<N (9) 
Recurrence 

B.@ = We ij bj (O41) PriG)t =T-1,..,.1, 1sisN (10) 


Termination (t=1) 
P(O1A) = Yh mi di(o) BO (11) 


In posterior decoding, for each t, 1 < t < T, we would find q,; that maximizes P(q; | O,A). 

Let y,;(i) be the probability of the being in state s; at time t for the given observation sequence O 
and the model A (posterior probability). Thus, at each time, we can choose the optimal state q;, that 
maximizes y;(i). 


ye@) = Par = 5; | 0,2} (12) 
ak Pde = Sj, O|A} 

~~ pfO|Ay (13) 
_ ari) Bei) 

~ SN are (i) Be(i) (14) 


with the following constrain being satisfied: 
Lia ¥e = 1 (15) 
The individually most likely state G (the sequence of states obtained by posterior decoding) is defined thus: 


a: = arg max y_(C), 1 <t<T (16) 


In other words, at every position we choose the most probable state for that position. 
The pseudo code of posterior decoding algorithm is given by Algorithm 1. 


Algorithm 1: Classical posterior decoding algorithm 
input: A model A=(A,B,II) and a sequence of observations O = {0,0203...07} 
output: {4),92,93,---, Gr} 


begin 
forward variable calculation 
Initialization 
Ls for i = 1 to N do 
2: a,@) — my b;( 0) 
on end for 
Recurrence 
4; for t = 1 to T—1 do 
Die for j = 1 to N do 
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N 
6: ae) =[Y)_ a Oay] Ores) 
ae end for 
8: end for 


Termination 


9: POI = Sar 


backward variable calculation 


Initialization 
0: for i = 1 to N do 
1: Bp(i) =1 
2 end for 
Recurrence 
3: for t = T —1 downto 1 do 
As for i = 1 to N do 
N 
5: BO =) ay biOrus Ber) 
fa 
6: end for 
es end for 
¥:@) calculation 
8: for t = 1 to T do 
9: for i = 1 to N do 
ai i 
20: () = OB 
P(O|A) 
2 Ae end for 


22: end for 
individually most likely state calculation 
23: for t = 1 to T do 


24: G@ = argmaxy<ente@ 
25: end for 


3. RELATED WORK 

In [21], to improve the prediction of the topology of fully beta membrane proteins, Fariselli et al. 
propose a new algorithm for the HMMs decoding problem. This new algorithm, called Posterior- Viterbi, is a 
combination of the posterior and Viterbi algorithms. First, they compute the posterior probabilities of each 
state, then they use the Viterbi algorithm to look for the best posterior possible path through the model. It 
performs better than the others especially when several concurring paths are present. This algorithm is 
certainly effective, but in terms of time complexity it is slower than other algorithms of decoding (e.g., 
Viterbi, posterior decoding). While in terms of space complexity, it needs the same memory requirements as 
Viterbi and posterior. 

In [22], Sand et al. used new generations of multi-core processors that support the SSE instruction 
set to develop a library for HMMs using C++. It exploits an optimized implementation of forward and 
backward algorithms by reformulating matrix multiplications, and for each iteration for the division 
operation, it uses SSE instruction instead of the instruction for chunks multiplication to speed-up the 
calculation. Lunter et al. [23] propose a variant of the posterior decoding, marginalized posterior decoding, 
which differs from the classical algorithm in the way the intervals are treated. It takes into account the 
columns which contribute to an alignment to calculate this alignment that maximizes the posterior probability 
of the cumulative log of these columns. 

Do et al. [24] present the probcons algorithm using pair HMMs to estimate posterior probabilities 
for amino acid residues. It uses an alignment partition function to generate suboptimal alignments. It differs 
from other approaches in its use of maximum expected accuracy to align pairs of sequence profiles. To 
predict the sequence features that combine probabilities for homologs sequence features, Kall et al. [25] 
propose a posterior HMM decoder. This algorithm considers the mean posterior label probability of each 
position in a global sequence alignment. Bourlard et al. [26] improve the posterior probabilities using all 
possible acoustic information and prior knowledge to enhance the functioning of automatic speech 
recognition systems. The objective in this work is to improve the estimation of local posteriors by calculating 
posterior probability recursively to generate local posteriors considering all available acoustic information 
adding other preliminary information. Brown ef al. [27] outline a new HMMs decoding approach based on 
the labelling of sequences in such a way that the correct labelling of a sequence is close to the prediction. 

We propose an improvement of posterior decoding algorithm. It is a parallel distributed posterior 
decoding algorithm under Spark which makes it possible to speed up the algorithm for a high number of 
states or a high number of sequences. Thus, the improved algorithm allows the optimization of the 
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complexity and reduction of computation time. So, this solution is well adapted to big data applications (high 
scalability, effective management of heterogeneous data and easy integration in big data frameworks). 

In addition, the proposed solution based on Spark allows to benefit from the richness of its modules 
offering a variety of tools for collection, preprocessing and data cleaning, and a set of optimized algorithms 
for parallel calculation, analyzing and managing data in real time. It gives possibility of graph algorithms 
execution. 


4. PARALLEL DISTRIBUTED POSTERIOR DECODING ALGORITHM USING SPARK 

Many recent researches have focused on the parallel distributed implementation of classical 
algorithms using big data platforms including the classical algorithms of HMMs [28]-[33]. We implemented 
our algorithm under Spark using the Python language. We used the main concepts of the Spark framework to 
achieve this implementation; the MapReduce paradigm to perform parallel computations, the resilient 
distributed datasets (RDD) concept to distribute the data over many blocks and to reduce the communication 
cost, we used the broadcast variables. 

Spark is one of the platforms often used for big data processing to handle a huge amount of data in 
batch and real time processing modes. Spark uses RDDs to enable efficient reuse of data in a broad family of 
applications. RDDs are characterized by their fault tolerance property and allow the storage of intermediate 
data in memory using parallel data structures, the control of partitioning, and the manipulation of data using a 
set of operators. RDDs support two types of operations: transformations and actions. The transformations 
(e.g., map, filter, sample) return a new RDD while the actions, like reduce, collect, and count, evaluate and 
return a new value. 

Spark, like Hadoop [34], is based on a distributed storage system (e.g., HDFS [35]) to allow the 
storage of input and output data of Spark's jobs [36]. Spark is based on following elements: Spark core, 
which is the framework execution engine, Spark cluster manager, which manages the cluster resources 
(Kubernetes, Mesos [37], Yarn [38]), Spark SQL [39], Spark streaming [40], MLlib, the distributed machine 
learning library [41] and GraphX [42] as shown in Figure 1. 


Spark SQL + Spark MLIib Graphx 
DataFrames Streaming (Machine (Graph 
(Structured) (Real-time) Learning) Computation) 
Spark Core API (RDDs) 
YARN MESOS Standalone Scheduler 


Figure 1. Spark architecture 


In addition, to reduce the communication cost, we used the fundamental concept of Spark, the 
broadcast variables. A Spark action is performed through a set of steps. These steps are separated by 
distributed shuffle operations. Shared variables are therefore broadcast automatically and are cached. These 
are the common data needed for the tasks in each step. 

To optimize the calculations on Spark, we used vectors. Each column of the matrices is stored in a 
vector. Vectors are less consuming in terms of computation time [43]. It is better to work on vectors rather 
than on matrices. Indeed, even if the filling of the vectors represents more operation than the filling of a 
matrix, the program will be faster. The explanation is as: When filling a vector, starting from the first 
component, the processor automatically allocates cache memory for the following n components. Whereas 
when filling a matrix, only the components of the first line are allocated a place in the cache memory. Going 
to the next line resets the operation. So, for each matrix, we use a vector to store the elements of each 
column. For example, when calculating the posterior probability y,(i), the values of y;(1), y1(2), 71(3), 
..-, ¥1 CN) are stored in the vector Gamma, as shown in Figure 2. 

The steps of the proposed algorithm implemented under Spark is as. We first calculate the values of 
the forward and backward variables as we explained in Section 2 by parallelizing the loops on i and j. Under 
Spark, this task is performed using multiple executors in parallel. Then, from these stored values in Alpha; 
and Beta, vectors, we compute the y;,(i) in parallel for each t as shown in Figure 3. Thus, we use different 
RDD operators (map, reduce, ...) to efficiently perform calculations in parallel. For example, using the reduce 
function to calculate P(O|A), this allows to aggregate the elements of an RDD by applying a commutative 
and associative function passed as an argument instead of making a sum on the elements a7(i) with i which 
varies between | and N. According to Figure 3, for a value of t, the calculation of y,(i) N times is performed 
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only once. It is therefore a gain of T * N operations. This is not negligible for large programs. The y;,(i) for 
each t will be stored in the vector Gamma,. Then, we apply the function argmax on the y;(i) on all N 
states to find the individually most likely state. The proposed parallel distributed posterior decoding 


algorithm using Spark is presented in Algorithm 2. 


i=1 i=2 i=3 oes i= 
t=1 nmQ) n@) @) tee nN) Vector Gamma, 
| ya) (2) %(@) fag ¥2(N) | Vector Gamma, 
og. [iter te See 
[yr(1) yr(2) yr(3) 7 Yr (N) | Vector Gammar 


Figure 2. Calculation and storage of y; (i) 


= a, (1) 8,4) 

A erro 
AGC) 

(2) = P(OlA 


_ (3) B13) 


N N 
y(n) = 2OA® 


Oe) 
AS Servo 
ar (2) Br (2) 
¥r(2) = Pol) 
a ar (3) Br (3) 
BS eons 


_ &r(N)Br(N) 


Figure 3. Parallel calculation of the values of posterior probability y; (i) 


Algorithm 2: Parallel distributed posterior decoding algorithm under Spark 


Or nA UP WHR 


input: A model A=(A,B,/I), a sequence of observations O = {0,0203...07} 
output: The individually most likely state @ 

begin 

for each executor;of N executors do 

Parallel do 

aj) — my bj(o1) {7 € 01,2,3,...,N}} 

end for 

for t+ 1 to T-1 do 


for each executor;; of N*N executors do 
Parallel do 


calculate (map) @,(i)a;j and store a,(i) in Alpha;,i,j € {1,..,N} 
sum (reduce) ofa,(ajj, then multiple by by(0z41),ij € (1,..,N} 


end for 


end for 

P(O|A) < Alpha . reduce (lambda a, b : a + b) 
for each executor;of N executors do 

Parallel do 

Br) <1 {fj € {1,2,3,..,N} 

end for 

for t~ T—1 downto 1 do 


for each executor;; of N*N executors do 
Parallel do 

calculate B,.,(j)ajj;b; (0,41) and store B,(j) in Beta,,i,j € {1,..., N} 
end for 


end for 
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23 for each executor,; of T*N executors do 
24 Parallel do 

ai i 
25 calculate y, (i) < a 


and store y,(i) in Gamma,,i € {1,...,N};¢ € {1,..., T} 
26 end for 


27 for each executor, of T executors do 
28 Parallel do 

29 qG; <— argmax( Gamma,) 

30 end for 

31 return {9;, G&, 9, --, Gr} 


5. RESULTS AND DISCUSSION 
5.1. Experimentation setup 

We evaluated the new algorithm using Spark in a cloud environment. We used the t2.large cloud 
computing platform under Amazon EC2. It is characterized by a resizable computing capacity and a very 
high level of security. We carried out the experiments with a configuration consisting of 8 GB of memory 
and 2 CPU with 2.0.1 as version of Spark with 5 GB of storage for Amazon S3. T2 instances are expandable 
capacity instances that provide a high frequency Intel Xeon processor with expandable CPUs and present a 
high level of balance between computing, memory, and network resources. 

In this study, we evaluate the proposed algorithm by performing different experiments using dataset 
which consist of sequences of integers drawn from a multinomial distribution. In the first experiment, we 
fixed the number of sequences and measured the running time in terms of states number, then we fixed the 
number of states and measured the running time in terms of sequences number. To evaluate the efficiency of 
this parallel distributed implementation, we also measured the acceleration and parallelization efficiency of 
the proposed algorithm. For these last two measurements, we created four subsets of data with different 
numbers of sequences, and we measured the speedup then the efficiency by varying the number of used 
nodes. 


5.2. Computational complexity 

We compared the proposed parallel distributed posterior decoding algorithm to the classical one in 
terms of time and space complexities. As shown in Table 1, the results indicate a great improvement in time 
complexity compared to the classic version while the space complexity remains almost the same. Thanks to 
this implementation, the complexity has been reduced from O(N?(T — 1)) and has become O(T — 1) with 
N the states number and T the length of the observation sequence. 

The results in Table 2 matches the results in Table 1. This table presents a step-by-step time 
complexity comparison between classical posterior decoding algorithm and new parallel distributed 
algorithm under Spark. For most stages of the algorithm (i.e., forward, and backward variables, posterior 
probability) there is a remarkable improvement. In sum, from this table, the results of the proposed algorithm 
are much better than those of the conventional one. From the two tables, the results have shown that the 
proposed algorithm improved since the time complexity is considerably ameliorated. 


Table 1. Complexity comparison 
Complexity comparison 


Complexity 1 ype Classic posterior decoding __ Parallel distributed posterior decoding 
Time complexity O(N?(T — 1)) ocr — 1) 
Space complexity O(N2(T — 1)) O(N2(T — 1)) 


Table 2. Complexities comparison step by step of classical algorithm and parallel distributed under Spark 


Complexity comparison 


Open Classic posterior decoding _ Parallel distributed posterior decoding 
Initialization (Forward variable) O(N) o(1) 
Recurrence (Forward variable) O(N2(T — 1)) O(T — 1) 
Termination (Forward variable) O(N) o(1) 
Initialization (Backward variable) O(N) o(1) 
Recurrence (Backward variable) O(N2(T — 1)) O(T — 1) 
Calculation of posterior probability O(N(T — 1)) o(1) 
individually most likely state calculation O(N?(T — 1)) o(1) 
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5.3. Performance metrics 
5.3.1. Running time analysis 

In Figure 4, the results of the proposed parallel distributed version of posterior decoding algorithm 
performances, in terms of running time according to states number, are very significant. It is noticed from 
the curve in this figure that, with the increasing of the states number, the ratio between running time and 
states number remains a little close stable. Figure 5 shows the performances of the parallel distributed 
posterior decoding algorithm under Spark in terms of running time according to sequences number. The 
curve shows that, with the increase of the sequences number, the proposed algorithm presents good results in 
terms of running time. This explains that this algorithm is well suited to applications with very large 
sequences number. 


5.3.2. Speedup analysis 

To measure the performance of parallel implementations, one of the frequently used metrics is 
speedup. It measures the evolution of execution time as a function of the number of nodes. The acceleration 
is the benefit obtained by a parallel implementation of an algorithm (under p nodes) compared to the same 
algorithm on a single node. 
According to Amdahl's Law, the speedup is calculated as: 


=i (17) 


where T; and T,, are respectively the processing times on | and p resources. 
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Figure 4. Parallel distributed posterior decoding algorithm running time as a function of states number 
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Figure 5. Parallel distributed posterior decoding algorithm running time as a function of sequences number 


These can be cores in a processor, processors in a shared memory machine, nodes (PCs) in a cluster 
and disks in a mass storage system. In our case, Ts presents the execution time of the sequential algorithm 
and Tp the parallel algorithm execution time on p nodes. As it can be noticed in Figure 6, the proposed 
algorithm presents a significant improvement in execution time and this according to the good results 
obtained from speedup which increases with the number of nodes used while being relative to the volume of 
data processed. 
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Figure 6. Speedup of parallel distributed posterior decoding algorithm as a function of nodes number 


5.3.3. Parallelization efficiency analysis 
Efficiency is a profitability metric that allows to quantify the rate of good use of the resources used 
in a parallelization. This mesure is defined as: 


p = Sp/p (18) 


where S,, is the speedup and T,, is the parallel algorithm execution time on p nodes. 
According to Figure 7, the efficiency depends on the number of used nodes and on the volume of 
the data. So, for different subsets of data, a satisfactory efficiency rate has been obtained. 
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—2«— $2: 2000000 


Efficiency % 


$3: 3000000 
—*— $4: 4000000 


N=3 N=6 N=15 


Nodes number 


Figure 7. Parallelization efficiency as a function of nodes number 


According to these measurements of yield of parallel computation, the proposed algorithm presents 
a high level of scalability since the level of parallelism increases with the number of nodes. In addition to 
these performances, the implementation under a big data platform (i.e., Spark) allows to fully benefit from 
the advantages of using many data preprocessing tools especially for large scale multidimensional data, 
features selection and model’s evaluation. Indeed, Spark provides a variety of tools for collection, features 
extraction, selection and transformation and data cleaning, a set of efficient algorithms for analyzing and 
managing data in real time and powerful techniques for model evaluation and selection thanks to Spark’s 
MLIlib the machine learning library, to Spark SQL for querying large and structured data, to Spark Streaming 
to process streaming data and Spark GraphX to handle graphs and graph-parallel computation.It is also 
important to note that this improved algorithm can be easily transposed and integrated into any other big data 
framework.The results of the proposed algorithm compared to the results of other implementations and to the 
classical version show a big improvement in terms of execution time. 


6. CONCLUSION AND FUTURE WORK 

In this paper, we proposed a parallel distributed algorithm based on Spark. It is a new 
implementation of posterior decoding algorithm under Spark for hidden Markov models decoding problem. 
The proposed algorithm presents an improvement of the classical algorithm using the benefits of a big data 
framework (e.g., Spark). We evaluated the new algorithm and the findings verified that this one solves the 
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decoding problem significantly faster than the old algorithm. The obtained speedup is due to the 
implementation of the new algorithm under Spark, so to the data distribution over several blocks and parallel 
computation. It is worth mentioning that we only investigated the time complexity, while for space 
complexity, the algorithm in this paper will yield same complexity as the classic algorithm. Hence, in future, 
we will extend the idea of improvement of space complexity of the studied algorithm in this paper to the 
research. We plan to study the impact of the impoved algorithm in the context of big data problems in the 
most promising areas. Thus, this implementation of the classical posterior decoding algorithm under Spark 
optimized the complexity and reduced the computation time. We can say that we have succeeded to improve 
one of the most important algorithms of hidden Markov models to resolve the decoding problem leveraging 
one of the most promising big data technologies, the Spark framework, in a cloud environment. Finally, this 
parallel distributed posterior decoding algorithm allows, effectively, to meet the needs of this great digital 
revolution by proposing a well-adapted algorithm to the big data context. 
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